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Abstract 

Entropy  regularization  is  a  straightforward 
and  successful  method  of  semi-supervised 
learning  that  augments  the  traditional  con¬ 
ditional  likelihood  objective  function  with 
an  additional  term  that  aims  to  minimize 
the  predicted  label  entropy  on  unlabeled 
data.  It  has  previously  been  demonstrated 
to  provide  positive  results  in  linear-chain 
CRFs,  but  the  published  method  for  cal¬ 
culating  the  entropy  gradient  requires  sig¬ 
nificantly  more  computation  than  super¬ 
vised  CRF  training.  This  paper  presents 
a  new  derivation  and  dynamic  program 
for  calculating  the  entropy  gradient  that 
is  significantly  more  efficient — having  the 
same  asymptotic  time  complexity  as  su¬ 
pervised  CRF  training.  We  also  present 
efficient  generalizations  of  this  method 
for  calculating  the  label  entropy  of  all 
sub-sequences,  which  is  useful  for  active 
learning,  among  other  applications. 

1  Introduction 

Semi-supervised  learning  is  of  growing  importance 
in  machine  learning  and  NFP  (Zhu,  2005).  Condi¬ 
tional  random  fields  (CRFs)  (Fafferty  et  al.,  2001) 
arc  an  appealing  target  for  semi-supervised  learning 
because  they  achieve  state-of-the-art  performance 
across  a  broad  spectrum  of  sequence  labeling  tasks, 
and  yet,  like  many  other  machine  learning  methods, 
training  them  by  supervised  learning  typically  re¬ 
quires  large  annotated  data  sets. 


Entropy  regularization  (ER)  is  a  method  of  semi- 
supervised  learning  first  proposed  for  classification 
tasks  (Grandvalet  and  Bengio,  2004).  In  addition  to 
maximizing  conditional  likelihood  of  the  available 
labels,  ER  also  aims  to  minimize  the  entropy  of  the 
predicted  label  distribution  on  unlabeled  data.  By  in¬ 
sisting  on  peaked,  confident  predictions,  ER  guides 
the  decision  boundary  away  from  dense  regions  of 
input  space.  It  is  simple  and  compelling — no  pre¬ 
clustering,  no  "auxiliary  functions,”  tuning  of  only 
one  meta-parameter  and  it  is  discriminative. 

Jiao  et  al.  (2006)  apply  this  method  to  linear- 
chain  CRFs  and  demonstrate  encouraging  accuracy 
improvements  on  a  gene-name-tagging  task.  How¬ 
ever,  the  method  they  present  for  calculating  the 
gradient  of  the  entropy  takes  substantially  greater 
time  than  the  traditional  supervised-only  gradient. 
Whereas  supervised  training  requires  only  classic 
forward/backward,  taking  time  0(ns2)  (sequence 
length  times  the  square  of  the  number  of  labels), 
their  training  method  takes  0(n2s3) — a  factor  of 
0(ns)  more.  This  greatly  reduces  the  practicality 
of  using  large  amounts  of  unlabeled  data,  which  is 
exactly  the  desired  use-case. 

This  paper  presents  a  new,  more  efficient  entropy 
gradient  derivation  and  dynamic  program  that  has 
the  same  asymptotic  time  complexity  as  the  gradient 
for  traditional  CRF  training,  0(ns2).  In  order  to  de¬ 
scribe  this  calculation,  the  paper  introduces  the  con¬ 
cept  of  subsequence  constrained  entropy — the  en¬ 
tropy  of  a  CRF  for  an  observed  data  sequence  when 
part  of  the  label  sequence  is  fixed.  These  meth¬ 
ods  will  allow  training  on  larger  unannotated  data 
set  sizes  than  previously  possible  and  support  active 
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learning. 


2  Semi-Supervised  CRF  Training 


Lafferty  et  al.  (2001)  present  linear-chain  CRFs,  a 
discriminative  probabilistic  model  over  observation 
sequences  x  and  label  sequences  Y  =  {Y\..Yn), 
where  \x\  =  Y  =  n,  and  each  label  Y,  has  s  differ¬ 
ent  possible  discrete  values.  For  a  linear-chain  CRF 
of  Markov  order  one: 


pe{Y\x) 


where  Fk(x,Y)  =  Y,i  fk(x,Yi,Yi+1,i), 

and  the  partition  function  Z(x)  = 
EyexP  CEkekFk(x,Y)).  Given  training 

data  D  =  (di..dn),  the  model  is  trained  by 

maximizing  the  log-likelihood  of  the  data 
L(6;D)  =  ^2dlogpe(Y('d')\x('d'>)  by  gradient 

methods  (e.g.  Limited  Memory  BFGS),  where  the 
gradient  of  the  likelihood  is: 

d 

d  Y 


The  second  term  (the  expected  counts  of  the  features 
given  the  model)  can  be  computed  in  a  tractable 
amount  of  time,  since  according  to  the  Markov  as¬ 
sumption,  the  feature  expectations  can  be  rewritten: 


J2pe(Y\x)Fk(x,Y)  = 

Y 

EE  pe{Yi,Yi+1\x)fk{x,Yi,Yi+1) . 

i  ^5^+ 1 


A  dynamic  program  (the  forward/backward  algo- 
rithm)  then  computes  in  time  0(ns2)  all  the  needed 
probabilities  pg(Yi,  Yl+\ ),  where  n  is  the  sequence 
length,  and  s  is  the  number  of  labels. 

For  semi-supervised  training  by  entropy  regular¬ 
ization,  we  change  the  objective  function  by  adding 
the  negative  entropy  of  the  unannotated  data  U  = 

{ u\..un ).  (Here  Gaussian  prior  is  also  shown.) 

m  D’U)  =  Y1  ^gpe(Y^\x^)  -  J2  ^2 

n  k 

+  \J2po(y{u)\x{u))  iogpe(y(u)|x(u)) . 


This  negative  entropy  term  increases  as  the  decision 
boundary  is  moved  into  sparsely-populated  regions 
of  input  space. 

3  An  Efficient  Form  of  the  Entropy 
Gradient 

In  order  to  maximize  the  above  objective  function, 
the  gradient  for  the  entropy  term  must  be  computed. 
Jiao  et  al.  (2006)  perform  this  computation  by: 

^  -  H(Y\x)  =  covmy\x)[F(x,Y)}6 , 

where 

covPe(Y\x)[Fj(x,Y),  Fk(x,Y)]  = 

EPe(Y\x)[Fj(x,  Y),  Fk(x,  Y)] 

—  Epe(Y\x)  [Fj(x,  1  )]Epe(Y\x)  [Fk{x,  Y)]  . 


While  the  second  term  of  the  covariance  is  easy 
to  compute,  the  first  term  requires  calculation  of 
quadratic  feature  expectations.  The  algorithm  they 
propose  to  compute  this  term  is  0(n2s3)  as  it  re¬ 
quires  an  extra  nested  loop  in  forward/backward. 

However,  the  above  form  of  the  gradient  is  not 
the  only  possibility.  We  present  here  an  alternative 
derivation  of  the  gradient: 

-J-  -H(Y\x)  =  t ^Ey«(k»  logpe(Y|x) 

=  E  l°g Pe{Y\x) 

+  Pe{Y\x )  (^-  logpe(Y|a:)^ 

=  ^2,Pe{Y\x)  ^Fk{x,  Y)  -  Y^Pe{Y'\x)Fk(x,  Y')'j  logpe(Y|a:) 

+  X>(Y|*)  (Fk(x,Y)  -J2MY'\x)Fk(x,Y')  \  ■ 


Since  Y,YPe{y\x)T,Y'Pe{Y'\X)Fk{x,Y')  = 
Ey'  Po(Yl\X)Ff;(x,  Y'),  the  second  summand  can¬ 
cels,  leaving: 

^-H{Y\x)  =  J2pe{Y\x)logpg{Y\x)Fk(x,Y) 

Y 

E^Pe(F|a;)logp(,(Y|a;)j  ^^pe{Y'\x)Fk{x,Y') 


u 


Like  the  gradient  obtained  by  Jiao  et  al.  (2006), 
there  are  two  terms,  and  the  second  is  easily  com¬ 
putable  given  the  feature  expectations  obtained  by 


forward/backward  and  the  entropy  for  the  sequence. 
However,  unlike  the  previous  method,  here  the  first 
term  can  be  efficiently  calculated  as  well.  First, 
the  term  must  be  further  factored  into  a  form  more 
amenable  to  analysis: 

^ ~JPe(Y\x )  log pe{Y\x)Fk(x,  Y) 

Y 

=  S^JPe{Y\x)  log  pe(Y\x)  Y  fk(x,  Yi,  Yi+1,i ) 

Y  i 

=  Y  X/  fk(x,Yi,Yi+l,i) 

i  Yi,Yi+ 1 

Y  Pe{Y\x)\ogpe{Y\x) . 

Y-(i..i+ 1) 

Here,  YL(i..i+i)  =  {Y\..{i-i)Y{i+2)..n)-  In  order 
to  efficiently  calculate  this  term,  it  is  sufficient 
to  calculate  J2y  (  .+1)  Pe(Y\x)  \ogpg{Y\x)  for  all 
pairs  yi,  yl+  \ .  The  next  section  presents  a  dynamic 
program  which  can  perform  these  computations  in 
0(ns2). 

4  Subsequence  Constrained  Entropy 

We  define  subsequence  constrained  entropy  as 

Ha{Y_{a^b)\ya„b,x)  =  Y  Po{Y\x)  log pe(Y\x) . 

The  key  to  the  efficient  calculation  for  all  subsets 
is  to  note  that  the  entropy  can  be  factored  given  a 
linear-chain  CRF  of  Markov  order  1,  since  Y1+o  is 
independent  of  Yi  given  Yl+\. 

Y  P0(Y-(a..b)>ya..b\x)togPe(Y-(a..b),ya..b\x ) 

=  Y  Ps(ya..b\x)pe{Y-(a..b)\ya..b,x)x 

Y-(a..b) 

{log  Po(ya..b\x)  +logPe(Y-(a..b)\ya..b,x)) 
=Pe{ya..b\x)  logpo(ya..b\x) 

+  Pe(ya..b\x)Ht7(Y_^a_J}) \ya..b,  X ) 
=Pe{ya..b\x)  logpg(ya..b\x) 

+  Pe{ya..b\x)Ha(Y1^a_1)\ya,x) 

+  Pe{ya..b\x)Hl3(Y(b+1)  Jyb,x) . 

Given  the  Ha{-)  and  H&(-)  lattices,  any  sequence 
entropy  can  be  computed  in  constant  time.  Figure  1 


Figure  1:  Partial  lattice  shown  for  com¬ 

puting  the  subsequence  constrained  entropy: 
X;yp(T_(3..4),J/3,y4)logp(F_(3..4),J/3,J/4).  Once  the 
complete  Ha  and  Hi3  lattices  are  constructed  (in  the  direction 
of  the  arrows),  the  entropy  for  each  label  sequence  can  be 
computed  in  linear  time. 

illustrates  an  example  in  which  the  constrained  se¬ 
quence  is  of  size  two,  but  the  method  applies  to 
arbitrary-length  contiguous  label  sequences. 

Computing  the  HQ(-)  and  H,3{-)  lattices  is  easily 
performed  using  the  probabilities  obtained  by  for¬ 
ward/backward.  First  recall  the  decomposition  for¬ 
mulas  for  entropy: 

H(X,Y)  =  H(X)  +  H(Y\X) 

H(Y\X)  =  YP(X  =  x)H(Y\X  =  x)- 

X 

Using  this  decomposition,  we  can  define  a  dynamic 
program  over  the  entropy  lattices  similar  to  for¬ 
ward/backward: 

Ha(Yi..i\yi+i,x) 

=H{Yi\yi+i,x )  +  H(Yl.yi_l)\Y.uyi+i,x) 

=  YPe(yi\yi+l'x^logPe(yi\yi+l'x^ 

Vi 

Vi 

The  base  case  for  the  dynamic  program  is 
Ha($\yi)  =  p(yi)  logp(yi).  The  backward  entropy 
is  computed  in  a  similar  fashion.  The  conditional 
probabilities  pe(yi\yi-i,  x)  in  each  of  these  dynamic 
programs  are  available  by  marginalizing  over  the 
per-transition  marginal  probabilities  obtained  from 
forward/backward. 

The  computational  complexity  of  this  calcula¬ 
tion  for  one  label  sequence  requires  one  run  of  for¬ 
ward/backward  at  0(ns2),  and  equivalent  time  to 


calculate  the  lattices  for  H°  and  //  '.  To  calculate 
the  gradient  requires  one  final  iteration  over  all  label 
pairs  at  each  position,  which  is  again  time  0(ns2), 
but  no  greater,  as  forward/backward  and  the  en¬ 
tropy  calculations  need  only  to  be  done  once.  The 
complete  asymptotic  computational  cost  of  calcu¬ 
lating  the  entropy  gradient  is  0(ns2),  which  is  the 
same  time  as  supervised  training,  and  a  factor  of 
0(ns)  faster  than  the  method  proposed  by  Jiao  et 
al.  (2006). 

Wall  clock  timing  experiments  show  that  this 
method  takes  approximately  1.5  times  as  long  as 
traditional  supervised  training — less  than  the  con¬ 
stant  factors  would  suggest.1  In  practice,  since  the 
three  extra  dynamic  programs  do  not  require  re¬ 
calculation  of  the  dot-product  between  parameters 
and  input  features  (typically  the  most  expensive  paid 
of  inference),  they  arc  significantly  faster  than  cal¬ 
culating  the  original  forward/backward  lattice. 

5  Confidence  Estimation 

In  addition  to  its  merits  for  computing  the  entropy 
gradient,  subsequence  constrained  entropy  has  other 
uses,  including  confidence  estimation.  Kim  et  al. 
(2006)  propose  using  entropy  as  a  confidence  esti¬ 
mator  in  active  learning  in  CRFs,  where  examples 
with  the  most  uncertainty  arc  selected  for  presenta¬ 
tion  to  humans  labelers.  In  practice,  they  approxi¬ 
mate  the  entropy  of  the  labels  given  the  N-best  la¬ 
bels.  Not  only  could  our  method  quickly  and  ex¬ 
actly  compute  the  true  entropy,  but  it  could  also  be 
used  to  find  the  .™fisequence  that  has  the  highest  un¬ 
certainty,  which  could  further  reduce  the  additional 
human  tagging  effort. 

6  Related  Work 

Hernando  et  al.  (2005)  present  a  dynamic  program 
for  calculating  the  entropy  of  a  HMM,  which  has 
some  loose  similarities  to  the  forward  pass  of  the 
algorithm  proposed  in  this  paper.  Notably,  our  algo¬ 
rithm  allows  for  efficient  calculation  of  entropy  for 
any  label  subsequence. 

Semi-supervised  learning  has  been  used  in  many 
models,  predominantly  for  classification,  as  opposed 
to  structured  output  models  like  CRFs.  Zhu  (2005) 

'Reporting  experimental  results  with  accuracy  is  unneces¬ 
sary  since  we  duplicate  the  training  method  of  Jiao  et  al.  (2006). 


provides  a  comprehensive  survey  of  popular  semi- 
supervised  learning  techniques. 

7  Conclusion 

This  paper  presents  two  algorithmic  advances.  First, 
it  introduces  an  efficient  method  for  calculating 
subsequence  constrained  entropies  in  linear-chain 
CRFs,  (useful  for  active  learning).  Second,  it 
demonstrates  how  these  subsequence  constrained 
entropies  can  be  used  to  efficiently  calculate  the 
gradient  of  the  CRF  entropy  in  time  0(ns2) — 
the  same  asymptotic  time  complexity  as  the  for¬ 
ward/backward  algorithm,  and  a  0(ns)  improve¬ 
ment  over  previous  algorithms — enabling  the  prac¬ 
tical  application  of  CRF  entropy  regularization  to 
large  unlabeled  data  sets. 
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